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ABSTRACT 

Crowdsourced entity extraction is often used to acquire data for 
many applications, including recommendation systems, construc¬ 
tion of aggregated listings and directories, and knowledge base 
construction. Current solutions focus on entity extraction using a 
single query, e.g., only using “give me another restaurant”, when 
assembling a list of all restaurants. Due to the cost of human labor, 
solutions that focus on a single query can be highly impractical. 

In this paper, we leverage the fact that entity extraction often fo¬ 
cuses on structured domains, i.e., domains that are described by 
a collection of attributes, each potentially exhibiting hierarchical 
structure. Given such a domain, we enable a richer space of queries, 
e.g., “give me another Moroccan restaurant in Manhattan that does 
takeout”. Naturally, enabling a richer space of queries comes with a 
host of issues, especially since many queries return empty answers. 
We develop new statistical tools that enable us to reason about the 
gain of issuing additional queries given little to no information, and 
show how we can exploit the overlaps across the results of queries 
for different points of the data domain to obtain accurate estimates 
of the gain. We cast the problem of budgeted entity extraction over 
large domains as an adaptive optimization problem that seeks to 
maximize the number of extracted entities, while minimizing the 
overall extraction costs. We evaluate our techniques with experi¬ 
ments on both synthetic and real-world datasets, demonstrating a 
yield of up to 4X over competing approaches for the same budget. 

1. INTRODUCTION 

Combining human computation with traditional computation, com¬ 
monly referred to as crowdsourcing, has been recently proven bene¬ 
ficial in extracting knowledge and acquiring data for many applica¬ 
tion domains, including recommendation systems 0, knowledge 
base completion di, entity extraction and structured data collec¬ 
tion ll26l [^ . In fact, extracting information, and entities in par¬ 
ticular, from the crowd has been shown to provide access to more 
fine-grained information that may belong to the long tail of the web 
or even be completely unavailable on the web 

A fundamental challenge in crowdsourced entity extraction is 
reasoning about the completeness of the extracted information. Given 
a task, e.g., “extract all restaurants in New York”, that seeks to ex¬ 
tract entities from a specific domain by asking human workers, it 
is not easy to judge if we have extracted all entities (in this case 
restaurants). This is because we assume an “open world” |[9l. 

Recent work on has considered the problem of crowdsourced 
entity extraction using a single type of query that is asked to hu¬ 
mans; for our restaurant case, the query will be “give me another 
restaurant in New York”. That paper determines how many times 
this query must be asked to different human workers before we are 
sure we have extracted most of the restaurants in New York. How¬ 


ever, given the monetary cost inherent in leveraging crowdsourc¬ 
ing, it is easy to see that just using this query repeatedly will not 
be practical for real-world applications, for two coupled reasons: 
(a) wasted cost: we will keep receiving the most popular restau¬ 
rants and will have to issue many additional queries before receiv¬ 
ing new or unseen restaurants, thus, increasing the cost; (b) lack of 
coverage: beyond a point all the restaurants we get will already be 
present in our set of extracted entities — thus, we may never end 
up receiving less popular restaurants at all. 

In this paper, our goal is to make crowdsourced entity extraction 
practical. To do so, we focus on entity extraction over structured 
domains, i.e., a domain that can be fully described by a collection 
of attributes, each potentially being hierarchically structured. For 
example, in our restaurant case, we could have one attribute about 
location, one about cuisine, and one about whether the restaurant 
does takeout. Often the structure of domains in practical appli¬ 
cations is already known by design. We can then leverage this 
structure to use a much richer space of queries asked to human 
workers, considering all combinations of values for each of these 
attributes, e.g., “give me another Moroccan restaurant in Manhat¬ 
tan, New York, that does takeout”. In this manner, we can leverage 
these specific, targeted queries to diversify entity extraction and 
obtain not-so-popular entities as well. 

If we view the structured data domain as a partially ordered set 
(poset), then each query can be mapped to a node in the graph de¬ 
scribing its topology. Thus, our goal is to traverse the graph cor¬ 
responding to the input poset by issuing queries corresponding to 
various nodes, often multiple times at each node. However, the 
poset describing the domain can be often large, leading to many ad¬ 
ditional challenges in deciding which queries to issue at any node: 
(a) Sparsity: Many of the nodes in the poset are likely to be empty, 
i.e., the queries corresponding to those nodes are likely to not have 
any answers; avoiding asking queries corresponding to these nodes 
is essential to keep monetary cost low. (b) Interrelationships: Many 
of the nodes in the poset are “coupled” with one another; for ex¬ 
ample, the results from a few queries corresponding to “give me 
another Moroccan restaurant in Manhattan, New York” can inform 
whether issuing queries corresponding to “give me another Moroc¬ 
can restaurant in Manhattan, New York, that does takeout” is useful 
or not. We elaborate more on these challenges in Section fTJi using 
examples from a real-world scenario. 

Previously proposed techniques on do not directly apply to 
the scenario where we are traversing a poset corresponding to this 
structured data domain, and new techniques are needed. The main 
limitation of the aforementioned techniques is that they focus on 
estimating the completeness of a specific query and are agnostic to 
cost. As a consequence they do not address the problem of decid¬ 
ing which additional queries are worth issuing. To mitigate these 


shortcomings, one needs to tune the queries that are asked. How¬ 
ever, deciding which queries to ask among a large number of possi¬ 
ble queries (exponential in the number of attributes describing the 
input domain) and when and how many times to ask each query, 
are both critical challenges that need to be addressed. Furthermore, 
unlike previous work, we focus on the budgeted case, where we are 
given a budget and we want to maximize the number of retrieved 
entities; we believe this is a more practical goal, instead of the goal 
of retrieving all entities. Our crowdsourced entity extraction tech¬ 
niques can be useful for a variety of entity extraction applications 
that are naturally coupled to a structured domain, including: 

• A newspaper that wants to collect a list of today’s events to 
be displayed on the events page every day. In this case, the 
structured data domain could include event type (e.g., music 
concerts vs. political rallies) or location, among other attributes. 

• A stock trading firm wants to collect a list of stocks that have 
been mentioned by popular press on the previous day. In this 
case, the structured data domain could include stock type, pop¬ 
ular press article type, or whether the mention was positive or 
negative, among other attributes. 

• A real estate expert wants to curate a list of houses available for 
viewing today. The structured data domain in this case could 
include the price range, the number of fioors, etc. 

• A university wants to find all the faculty candidates on the job 
market. The structured data domain in this scenario includes 
the university of the applicant, specialization, and whether they 
are Ph.D./Postdoc. 

• The PC chair of a new conference wants to find potential re¬ 
viewers. The domain describing each of the candidates can be 
characterized by the university or company of the reviewer, ex¬ 
pertise, qualifications, and so on. 

1.1 A Real-World Scenario 

To exemplify the aforementioned challenges we review a large- 
scale real-world scenario where crowdsourcing is used to extract 
entities. We consider Eventbrite (www.eventbrite.com), an on¬ 
line event aggregator, that relies on crowdsourcing to compile a 
directory of events with detailed information about the location, 
type, date and category of each event. Typically, event aggregators 
are interested in collecting information about diverse events span¬ 
ning from conferences and music festivals to political rallies across 
different location, i.e., countries or cities. In particular, Eventbrite 
collects information about events across different countries in the 
world. Each country is split into cities and areas across the coun¬ 
try. Moreover, events are organized according to their type and 
topic. The attributes and their corresponding structure are known 
in advance and are given by the design of the application. We col¬ 
lected a dataset from Eventbrite spanning over 63 countries that are 
divided into 1,709 subareas (e.g., states) and 10,739 cities, contain¬ 
ing events of 19 different types, such as rallies, tournaments, con¬ 
ferences, conventions, etc. and a time period of 31 days spanning 
over the months of October and November. 

Two of the three dimensions, i.e., location and time, describing 
the domain of collected events are hierarchically structured. The 
poset characterizing the domain can be fully specified if we con¬ 
sider the cross product across the possible values for location, event 
type and time. Eor each of the location, time, type dimensions we 
also consider a special wildcard value. Taking the cross-product 
across the possible values of these dimensions results in poset with 
a total of 8,508,160 nodes containing 57,805 distinct events over¬ 
all. We point out that the events associated with a node in the poset 
overlap with the events corresponding to its descendants. Eirst, we 
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Figure 1: The population of different nodes in the Eventbrite domain. 

demonstrate how the sparsity challenge applies to Eventbrite. 

Example 1 . We plot the number of events for each node in the 
poset describing the Eventbrite domain. Out of 8,508,160 nodes 
only 175,068 nodes are associated with events while the remaining 
have zero events. Figure \J\ shows the number of events per node 
(y-axis is in log-scale). Most of the populated nodes have less than 
100 events. Additionally, the most populated nodes of the domain 
correspond to nodes at the higher levels of the poset. When extract¬ 
ing events from such a sparse domain one needs to carefully decide 
on the crowdsourced queries to be issued especially if operating 
under a monetary budget. 

As mentioned before, a critical challenge in such large domains 
is deciding on the queries to ask. However, the hierarchical struc¬ 
ture of the data domain presents us with an opportunity. One ap¬ 
proach would be to perform a top-down traversal of the poset and 
issue queries at the different nodes. Nevertheless, this gives rise to 
a series of challenges: (i) how can one decide on the number of 
queries to be asked at each node, (ii) when should one progress to 
deeper levels of the poset and (iii) which subareas should be ex¬ 
plored. We elaborate on these in Section [2| Next, we focus on the 
second challenge, i.e., the interdependencies across poset nodes. 

Example 2. We consider again the Eventbrite dataset and plot 
the pairwise overlaps of the ten most populous nodes in the domain. 
Eigure^shows the Jaccard index for the corresponding node pairs. 
As shown the event populations corresponding to these nodes over¬ 
lap significantly. It is easy to see that when issuing queries at a 
certain domain node, we not only obtain events corresponding to 
this node but to other nodes in the domain as well. 

A critical issue that stems from the overlaps across nodes is being 
able to decide how many answers to expect when issuing an addi¬ 
tional query at a node whose underlying population overlaps with 
nodes associated with previous queries. In Section [J] we elaborate 
more on the dependencies across nodes of the poset. 

1.2 Contributions 

Motivated by the examples above, we study the problem of entity 
extraction over structured domains. More precisely, we focus on 
domains described by a collection of attributes, each following a 
known hierarchical structure, i.e., we assume that for each attribute 
the corresponding hierarchy is known. Such hierarchies are usually 
dictated by the design of applications. Moreover, as controlling 
the overall extraction cost in large-scale applications is crucial we 
focus on budgeted crowd entity extraction. 

We propose a novel algorithmic framework that exploits the struc¬ 
ture of the domain to maximize the number of extracted entities un¬ 
der given budget constraints. In particular, we view the problem of 
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Figure 2: Pairwise overlaps for the 10 most populous nodes. 

entity extraction as a multi-round adaptive optimization problem. 
At each round we exploit the information on extracted entities ob¬ 
tained by previous queries to adaptively select the crowd query that 
will maximize the cost-gain trade-off at each round. The gain of a 
query is defined as the number of new unique entities extracted. 

We consider generalized queries that ask workers to provide us 
with entities from a domain D and can also include an exclude 
list. In general such queries are of the type “Give me k more en¬ 
tities with attributes X that belong in domain D and are not in 
{A, B, ...y\ Extending techniques from the species estimation 
and building upon the multi-armed bandits literature, we introduce 
a new methodology for estimating the gain for such generalized 
queries and show how the hierarchical structure of the domain can 
be exploited to increase the number of extracted entities. Our main 
contributions are as follows: 

• We study the challenge of information fiow across entity ex¬ 
traction queries for overlapping parts of the data domain. 

• We formalize the notion of an exclude list for crowdsourced en¬ 
tity extraction queries and show how previously proposed gain 
estimators can be extended to handle such queries. 

• We develop a new technique to estimate the gain of generalized 
entity extraction queries under the presence of little informa¬ 
tion, i.e., only when a small portion of the underlying entity 
population has been observed. We empirically demonstrate its 
effectiveness when extracting entities from sparse domains. 

• We introduce an adaptive optimization algorithm that takes as 
input the gain estimates for different types of queries and iden¬ 
tifies querying policies that maximize the total number of re¬ 
trieved entities under given budget constraints. 

• Finally, we show that our techniques can effectively solve the 
problem of budgeted crowd entity extraction for large data do¬ 
mains on both real-world and synthetic data. 

2. PRELIMINARIES 

In this section we first define structured domains, then describe 
entities and entity extraction queries or interfaces, along with the 
response and cost model for these queries. Then, we define the 
problem of crowd entity extraction over structured domains that 
seeks to maximize the number of extracted entities under budget 
constraints and present an overview of our proposed framework. 

2.1 Structured Data Domain 

Let P be a data domain described by a set of discrete attributes 
Ad — {Ai, A 2 ,..., Ad}. Let dom{Ai) denote the domain of 
each attribute Ai G Ad- We focus on domains where each at¬ 
tribute Ai is hierarchically organized. For example, consider the 
Eventbrite domain introduced in Section [TT] The data domain V 


corresponds to all events and the attributes describing the entities in 
V are Ad = {“Event Type”, “Location”, “Date”}. Figure [3] shows 
the hierarchical organization of each attribute. 


Eventbrite Event Data Domain 



Figure 3: The attributes describing the Eventbrite domain and the hierar¬ 
chical structure of each attribute. 

The domain V can be viewed as a poset, i.e., a partially ordered 
set, corresponding to the cross-product of all available hierarchiefl 
Part of the poset corresponding to the previous example is shown 
in Figure |4] We denote this cross-product as Hd- As can be seen 
in Figure |4] there are nodes, such as {}, where no attributes are 
specified, and nodes, such as {XI} and {Cl} where just one of the 
attribute values is specified, as well as nodes, such as {X2, ST2}, 
where multiple attribute values are specified. 



Figure 4: Part of the poset defining the entity domain for Eventbrite. 

2.2 Entities and Entity Extraction Queries 

Entities. Our goal is to extract entities that belong to the domain 
V. We assume that each entity e can be uniquely associated with 
one of the leaf nodes in the hierarchy fi d ; that is, there is a unique 
set of “most-specific” values of Ai,..., for every entity. For 
example, in Eventbrite, each entity (here, a local event) takes place 
in a specific city, and on a specific day. Our techniques also work 
for the case when entities can be associated only with “higher level” 
nodes, but we focus on the former case for simplicity. 

Queries. Next, we describe queries for extracting entities from the 
crowd. First, a query q is issued at a node v E Hd', that is, a 
query specifies zero or more attribute values from Ai,Ad that 
are derived from the corresponding values of v, implicitly requiring 
the worker to find entities that match the specified attribute values. 

Given a query issued at a node, there are three different config¬ 
urations one can use to extract entities from the crowd: The first 
configuration corresponds to single entity queries where workers 
are required to provide “one more” entity that matches the specified 
attribute values mentioned in the query. Considering the Eventbrite 
example introduced in the previous section, an example of a sin¬ 
gle entity query would be asking a worker to provide “a concert in 
Manhattan, New York”. The second configuration corresponds to 
queries of size k where workers are asked to provide up to k distinct 
entities. Finally, the last configuration corresponds to exclude list 
queries. Here, workers are additionally provided with a list E of 
I entities that have already been extracted and are required to pro¬ 
vide up to k distinct entities that are not present in the exclude list. 
It is easy to see that the last configuration generalizes the previous 
two. Therefore, in the remainder of the paper, we will only consider 
queries using the third configuration. To describe a query, we will 

^Note that V is not a lattice since there is no unique infimum. 




























use the notation q{k, E) denoting a query of size k accompanied 
with an exclude list E of length 1. We will denote the configuration 
characterizing the query as (/c, /). 

Query Response. Given a query q{k, E) issued at a node v G 
Tin, ^ human worker gives us k distinct entities that belong to the 
domain V, match the specified attribute values mentioned in the 
query (derived from v), and are not present in E. Furthermore, the 
human worker provides us the information for the attributes that 
are not specified in q for each of the k entities. For example, if our 
query is “a concert in Manhattan, New York”, with k = 1, = 0, 

the human worker gives us one concert in Manhattan, New York, 
but also gives us the day on which the concert will take place (here, 
the missing, unspecified attribute). If the query is “a concert in the 
US”, with k = 1, = 0, the human worker gives us one concert in 

the US, but also gives the day on which the concert will take place, 
as well as the specific city. If less than k entities are present in the 
underlying population, workers have the flexibility to report either 
an empty answer or a smaller number of entities (Section 

While the reader may wonder if getting additional attributes for 
entities is necessary, note that this information allows us to reason 
about which all nodes inHD the entity belongs to; without this, it is 
difficult to effectively traverse the poset. Furthermore, we find that 
in most practical applications, it is useful to get the values of the 
missing attributes to organize and categorize the extracted entities 
better. Similar query interfaces that ask users to fully specify the 
attributes of entities have been proposed in recent literature Wl- 

Finally, answers are expected to be duplicated across workers, 
who may also specify or extract an entity incorrectly. Resolving 
duplicate entities during extraction is crucial as this information is 
later used to estimate characterize the completeness of extracted 
entities, and thus, reason about the gain of additional queries. Ex¬ 
traction errors can be resolved by leveraging the presence of dupli¬ 
cate information and by applying de-duplication and entity resolu¬ 
tion techniques. At a high-level one can use an entity resolution 
or string similarity (e.g., jaccard coefficient) algorithm to identify 
duplicate entities. Furthermore, the additional attributes for each 
entity, can be used to further ascertain similarity of entities. We re¬ 
fer the user to Getoor and Machanavajjhala do) for an overview of 
entity resolution techniques. Finally, standard truth discovery tech¬ 
niques can be used to identify the correct attribute values for enti¬ 
ties. Nevertheless entity resolution and truth discovery are orthog¬ 
onal problems and not the focus of this paper. In our experiments 
on real datasets, we found that there were no cases where humans 
introduced errors to the attribute values of extracted entities. Only 
minor errors (e.g., misspelled entity names) were detected and fixed 
manually. 

Query Cost. In a typical crowdsourcing marketplace, tasks have 
different costs based on their difficulty. Thus, crowdsourced queries 
of different difficulties should also exhibit different costs. We as¬ 
sume we are provided with a cost function c(-) that obeys the fol¬ 
lowing properties: (a) given a query with fixed size its cost should 
increase as the size of its exclude list is increasing, and (b) given a 
query with a fixed exclude list size its cost should increase as the 
number of requested answer increases. These are fixed upfront by 
the interface-designer based on the amount of work involved. 

2.3 Crowdsourced Entity Extraction 

The basic version of crowdsourced entity extraction im seeks to 
extract entities that belong to P, by simply using repeated queries 
at the root node, with /c = 1, F^ = 0. When considering large entity 
domains, one may need to issue a series of entity extraction queries 
at multiple nodes in Hd — often overlapping with each other — so 


that the entire domain is covered. Issuing queries at different nodes 
ensures that the coverage across the domain will be maximized. 

We let TT denote a querying policy, i.e., a chain of queries at 
different nodes in Ed- Notice that multiple queries q(k,E) can 
be issued at the same node. Let C( 7 r) denote the overall cost, in 
terms of monetary cost of a querying policy tt. We define the gain 
of a querying policy tt to be the total number of unique entities, 
denoted by S{'k) extracted when following policy tt. Thus, there 
is a natural tradeoff between the gain (i.e., the number of extracted 
entities) and the cost of policies. 

Here, we require that the user will only provide a monetary bud¬ 
get Tc imposing a constraint on the total cost of a selected query¬ 
ing policy, and optimize over all possible querying policies across 
different nodes of Tin- Our goal is to identify the policy that max¬ 
imizes the number of retrieved entities under the given budget con¬ 
straint. More formally, we define the problem of budgeted crowd 
entity extraction as follows: 

Problem 1 (Budgeted Crowd Entity Extraction). 
Let E be a given entity domain and Tc a monetary budget on the 
total cost of issued queries. The Budgeted Crowd Entity Extraction 
problem seeks to find a querying policy tt* using queries q{k,E) 
over nodes in ELd that maximizes the number of unique entities 
extracted 8 (tt*) under the constraint C( 7 r*) < Tc. 

The optimal policy not only specifies the nodes at which queries 
will be executed but also the size and exclude list of each query. 

The cost of a querying policy tt is defined as the total cost of all 
queries issued by following tt. We have that C^k) — c{q) 

where the cost of each query q is defined according to a cost model 
specified by the user. Computing the total cost of a policy tt is 
easy. However, the gain 8 (tt) of a policy tt is unknown as we do 
not know in advance the entities corresponding to each node inELD, 
and hence, needs to be estimated, as we discuss next. 

The problem of budgeted crowd entity extraction is an instance 
of a generalization of the stochastic knapsack problem |[T^ 
where each item has a deterministic cost (weight) but a stochastic 
profit. The stochastic knapsack problem is known to be NP-hard 
and so is the budgeted crowd entity extraction problem. 

2.4 Underlying Query Response Model 

To reason about the occurrence of entities as response to specific 
queries, we need an underlying query response model. Our query 
response model is based on the notion of popularity. 

Popularities. We assume that each underlying entity has a fixed, 
unknown popularity value with respect to crowd workers. Given a 
query g(l, 0 ), asking for one entity without using an exclude list, 
the probability that we will get entity e that satisfies the constraints 
specified by q is nothing but the popularity value of e divided by 
the popularity value of all entities e' that also satisfy the constraints 
in q. As an example, if there are only two entities ei, 62 that satisfy 
the constraints specified by a given query qi , with popularity val¬ 
ues 3 and 2, then the probability that we get ei on issuing a query 

(1, 0) is 3/5. If an exclude list E is specified, then the probability 
that we will get an entity e ^ F^ is the popularity value of e di¬ 
vided by the popularity values of all entities c! ^E also satisfying 
the constraints specified by q. We do not assume that all work¬ 
ers follow the same popularity distribution. Rather the overall 
popularity distribution can be seen as an average of the popularity 
distributions across all workers. 

Thus, since workers are asked to provide a limited number of 
entities as response to a query, each entity extraction query can be 
viewed as taking a random sample from an unknown population of 


entities. In the rest of the paper, we will refer to the distribution 
characterizing the popularities of entities in a population of entities 
as the popularity distribution of the population. We note that this is 
equivalent to the underlying assumption in the species estimation 
literature (h) (Section[3]). 

Then, estimating the gain of a query q{k,E) at a node v E TId 
is equivalent to estimating the number of new entities extracted by 
taking additional samples from the population of v given all the 
retrieved entities by past samples associated with node v |36l. 

Samples for a Node. When extracting entities, the retrieved enti¬ 
ties for a node v (i.e., the running sample) may correspond to two 
different kinds of samples: (i) those that were extracted by consid¬ 
ering the entire population corresponding to node v (ii) and those 
that we obtained by sampling only a part of the population cor¬ 
responding to V. Samples for a node v can be obtained either by 
querying node v or by indirect information flowing to v by queries 
at other nodes. We refer to the latter case as dependencies across 
queries. 

Querying node {EventType X1} 



Figure 5: An example query that extract an entity sample from the red node. 
The nodes marked with green correspond to the nodes for which indirect 
entity samples are retrieved. 

We use an example considering the poset in Figure IH to illus¬ 
trate these two cases. The example is shown in Figure [5] As¬ 
sume a query g(/c,0) issued against node {EventType XI}. As¬ 
sume that the query result contains entities that correspond only to 
node {X1,ST2}. The green nodes in Figure [5] are nodes for which 
samples are obtained indirectly without querying them. Notice, that 
all these nodes are ancestors of {X1,ST2}. Analyzing the samples 
for the different nodes we have: 

• The samples corresponding to nodes (Xl, Clj and {X1,ST2| 
were obtained by considering their entire population. The rea¬ 
son is that node (EventType XI j is an ancestor of both and the 
entity population corresponding to it fully contains the popula¬ 
tions of both {X1,C1} and {X1,ST1}. 

• The samples corresponding to nodes { }, (Country Clj and 
{State ST2| were obtained by considering only part of their 
population. The reason is that the population of node (Event- 
Type XII does not fully contain the populations of these nodes. 

Samples belonging to both types need to be considered when 
estimating the gain of a query at a node in u G Hd. To address this 
issue we merge the extracted entities for each node mfiD into a 
single sample and treat the unifled sample as being extracted from 
the entire underlying population of the node. As we discuss later in 
Section|4]we develop querying strategies that traverse the poset If d 
in a top-down approach, hence, the number of samples belonging 
in the first category, i.e., samples retrieved considering the entire 
population of a node, dominates the number of samples retrieved by 
considering only part of a node’s population. Moreover, it has been 
shown by Hortal et al. tni that several of the techniques that can be 
used to estimate the gain of a query (see Section [3]) are insensitive 
to differences in the way the samples are aggregated. 


Iterate 
until no 
budget is left 


Estimate the gain for each candidate poset node: 


use the retrieved entities and estimate the number 
of new entities to be extracted for different 
query sizes k and different exclude list sizes I 


Using the gain estimates as input: 

select the optimal poset node, query size k and 
exclude list size I and execute a new crowd entity 
extraction query 


Figure 6: Framework overview for budgeted entity extraction. 


2.5 Framework Overview 

We view the optimization problem described in Section I2.3I as 
a multi-round adaptive optimization problem where at each round 
we solve the following subproblems: 

• Estimating the Gain for a Query. For each node in u G Hd, 

consider the retrieved entities associated with v and estimate 
the number of new unique entities that will be retrieved if a 
new query q{k,E) is issued at v. This needs to be repeated for 
different query configurations. 

• Detecting the Optimal Querying Policy. Using the gain es¬ 
timates from the previous problem as input, identify the next 
(query configuration, node) combination so that the total gain 
across all rounds is maximized with respect to the given budget 
constraint. When identifying the next query we do not explic¬ 
itly optimize for the exclude list to be used. We rather optimize 
for the exclude list size 1. Once the size is selected, the exclude 
list is constructed in a randomized fashion. We elaborate more 
on this design choice in Section 

Our proposed framework iteratively solves the aforementioned prob¬ 
lems until the entire budget is used. Figure [6| shows a high-level 
diagram of our proposed framework. 


3. ESTIMATING THE GAIN OF QUERIES 

Previous work on has drawn connections between this problem 
and the species estimation literature (hi. However, the proposed 
techniques therein do not work for queries that specify an exclude 
list. Moreover, they rely on the presence of a relatively large sam¬ 
ple and tend to exhibit negative biases i.e., they under¬ 

estimate the expected gain. Negative biases can severely impact 
entity extraction over large domains since nodes that contain enti¬ 
ties that belong in the long tail of the popularity distribution may 
never be queried as they may be deemed to have zero population. 
In this section, we first review the existing methodology for esti¬ 
mating the gain of a query. Then we discuss how these estimators 
can be extended to consider an exclude list. Finally, we propose 
a new gain estimator for generalized queries q{k, E) that exhibits 
lower biases, and thus, improved performance, in the presence of 
little information than previous techniques (see Section [5]). 

3.1 Previous Estimators 

Consider a specific node v E Tin- Prior work only considers 
samples retrieved from the entire population associated with v and 
does not consider an exclude list. Let Q be the set of all existing 
samples retrieved by issuing queries against v without an exclude 
list. These samples can be combined into a single sample corre¬ 
sponding to multi-set of size n = s\ze{q). Let fi denote the 

number of entities that appear i times in this unifled sample, and 
let /o denote the number of unseen entities from the population un¬ 
der consideration. Finally, let C be the population coverage of the 
unifled sample, i.e., the fraction of the population covered by the 











sample C= 

A new query at node v can be viewed as increasing the 

size of the unified sample by k. Prior work used techniques from 
species estimation to estimate the expected number of new entities 
returned in ^(A:,0). Shen et al. lISTI , derive an estimator for the 
number of new species Nshen that would be found in an increased 
sample of size k. The approach assumes that unobserved entities 
have equal relative popularity. An estimate of the unique elements 
found in an increased sample of size k is given by: 


Nshen — fo 




1-C 

fo 


( 1 ) 


The second term of Shen’s formula corresponds to the probability 
that at least one unseen entity will be present in a query asking for 
k more entities. Thus, multiplying this quantity with the number of 
unseen entities fo corresponds to the expected number of unseen 
entities present in the result of a new query q{k, 0). 

The quantities fo and C are unknown and thus need to be esti¬ 
mated considering the entities in the running unified sample. The 
coverage can be estimated by considering the Good-Turing estima¬ 
tor C = 1 — ^ for the existing retrieved sample. On the other hand, 
multiple estimators have been proposed for estimating the number 
of unseen entities fo. Trushkowsky et al. proposed a variation 
of an estimator introduced by Chao et al. to estimate fo. Nev¬ 
ertheless, the authors argue that the original estimator proposed by 
Chao performs similarly with their approach when estimating the 
gain of an additional query q(k^ 0). Next, we discuss how one can 
estimate the return of a query ^(/c, in the presence of an exclude 
list E of size I and potential negative answers. 


3.2 Exclude Lists and Negative Answers 

A query q{k, E) with E ^ % issued at node v £ Hd effectively 
limits the sampling to a restricted subset of the entity population 
corresponding to node v. To estimate the expected return of such 
a query, we need to update the estimates fo and C before applying 
Equation (1), by removing the entities in E from the running sam¬ 
ple for node v and updating the frequency counts fi and sample size 
n. This approach requires that the exclude list is known in advance. 
We discuss how we construct an exclude list in Section 

Next, we study the effect of negative answers on estimating the 
gain of future queries. It is possible to issue a query at a specific 
node V E Ed and receive no entities, i.e., we receive a negative 
answer. This is an indication that the underlying entity population 
of V is empty. In such a scenario, we assign the expected gain of 
future queries at v and all its descendants to zero. Another type 
of negative answer corresponds to issuing a query at an ancestor 
node u of u and receiving no entities for v. In this case, we do not 
update our estimates for node u as entities from other descendants 
of u may be more popular than entities associated with u. 


3.3 Direct Gain Estimation 

The techniques reviewed in Section 13.11 result in negative bias 
when the number of observed entities from a population represents 
only a small fraction of the entire population |[T4l[3T|. This holds 
for the large and sparse domains we consider in this paper. To ad¬ 
dress this problem, Hwang and Shen d proposed a regression 
based technique to estimate fo and show that it results in smaller 
biases. However, estimating the total gain of a query requires cou¬ 
pling this new estimator with Equation (1), thus, it may still exhibit 
negative bias. To eliminate negative bias, we propose a direct es¬ 
timator for the gain of generalized queries q{k,E) without using 
Equation (1). We build upon the techniques in d and use a re¬ 


gression based technique that captures the structural properties of 
the expected gain function. 

Let S denote the total number of entities in the population under 
consideration Mid pi the abundance probability (i.e., popularity) of 
entity i. Given a sample of size n from the population, define K(n) 

to be K(n) = . Eirst, we focus on queries with- 

E/i = i Pi\^ Pi) 

out an exclude list. Later we relax this and discuss queries with 
exclude lists. We have the following theorem on query gain: 


Theorem 1. Given a node v ^ Ed and a corresponding en¬ 
tity sample of size n, let f\ and f 2 denote the number of entities 
that appear exactly once (i.e., singletons) and exactly twice respec¬ 
tively. Let G denote the number of new items retrieved by a query 
q(m^^). We have that: 




(I -\—n n-\- m 

V ' n+m/ 


( 2 ) 


where K — K{n) and K' — K(n -\- m). 


Prooe. To derive the new estimator we make used of the gen¬ 
eralized jackknife procedure for species richness estimation fY2\ . 
Given two (biased) estimators of S, say Si and *§ 2 , let R be the 
ratio of their biases: 


EjSQ-S 
E{S2) - S 


(3) 


By the generalized jackknife procedure, we can completely elimi¬ 
nate the bias resulting from either or S 2 via 


S = G{Si,S2) = 


(4) 


provided the ratio of biases R is known. However, R is unknown 
and needs to be estimated. 

Let Dn denote the number of unique entities in a unified sam¬ 
ple of size n. We consider the following two biased estimators of 
S\ Si = Dn and ^2 = Y^j=i Dn-i(j)/n = Dn - fi/n where 
Dn-i{j) is the number of species discovered with the jih observa¬ 
tion removed from the original sample. Replacing these estimators 
in Equation (4) gives us: 

S = D„ + -L^L (5) 

1 — R n 


Similarly, for a sample of increased size n + m we have: 


S = Dn+ m T 


R' fl 

1 — R' n E m 


( 6 ) 


where R' is the ratio of the biases and f[ the number of singleton 
entities for the increased sample. Let K = and K' = i^pr • 
Taking the difference of the previous two equations we have: 




fi 


Dn+m-Dn=K^-K'- , 

n n + m 


Therefore, we have: 


(7) 


G = rL - K' (8) 

n n + m 

We need to estimate K, K' and f[. We start with f [, which denotes 
the number of singleton entities in the increased sample of size 
n -\- m. Notice, that f{ is not known since we have not obtained 
the increased sample yet, so we need to express it in terms of fi , 
i.e., the number of singletons, in the running sample of size n. We 
have: 


fi^G + fi-fi 


(9) 













where /f denotes the number of old singleton entities from the 
sample of size n that appeared in the additional query of size m. 
Let El denote the set of singleton entities in the old sample of size 
n. We approximate /f by its expected value: 


where i = 1 ,..., n — 1, /3o > 0, /3i < 0, (32 >0 and Ci denotes 
random errors. It follows that K = f3o. To estimate the value of 
K' for an increased sample of size n + m, we first show that K in¬ 
creases monotonically as the size of the running sample increases. 


fi = Pr[e appears in query of size m] (10) 

eeEi 

We compute the probability of an old singleton entity appearing 
in an additional query as follows. Let pe denote the popularity 
of entity e. As described before, an additional query of size m 
corresponds to taking a sample of size m from the underlying en¬ 
tity population without replacement. However, m is significantly 
smaller compared to the size of the underlying population, thus, we 
can consider a that taking a sample of size m corresponds to taking 
a sample with replacement. Following this we have that: 

Pr[e appears in query of size m] = 1 — (1 — Pe)^ (H) 

Following a standard approach in the species estimation literature 
we assume that the popularity of retrieving a singleton entity again 
is the same for all singleton entities. This popularity can be com¬ 
puted using the corresponding Good-Turing estimator considering 
the running sample. We have: 

VeGSi,pe=Pi=0(l) = i2L (12) 

n 

where /2 is the number of entities that appear twice in the sample 
and fi is the number of singletons. Eventually we have that: 


E = Ml - (1 - PiD 


and 


f[ = G+Mi-Pir 

Replacing the last equation in Equation (8) we have: 

G = Ki^- 

n n-\- m 


G = K^-K'- ^ 


n + m 


-K 


jiji-p) 

n + m 

n + m n n-\- m 




n-|-m 2 


^ ^ n n + m 


□ 


(13) 

(14) 


All quantities apart from K and K' in Equation (2) are known. 
The value of K can be estimated using the regression approach 
introduced by Hwang and Shen ful . Erom the Cauchy-Schwarz 
inequality we have that: 


^ ^ -PQ" > {n - l)/i 

EUpM-pM-^~ 2/2 

This can be generalized to: 

K =^> ~ > (» - 2)/2 ^ 

h - 2/2 - 3/3 -■■■ 


(15) 


(16) 


Let q(i) = . From the above we have that the function 

g{x) is a smooth monotone function for all x > 0. Moreover, let pi 
denote a realization of g{i) mixed with a random error. Hwang and 
Shen how one can use an exponential regression model to estimate 
K. The proposed model corresponds to: 

Pi = ^0 exp(/3iz^2) + a (17) 


Lemma 1. The function K{ri) — increases 

monotonically, i.e., K(n-\- m) > K(n)^\/n^m > 0. 

Proof. In the remainder of the proof we will denote K(n-\-m) 

SLS K . By definition we have that K = r,_ \n-i 

Ei = i PiM Pi) 

K' = E.f fpi(i-p.o»+Li ■ We want to show that: 

Ef=i(i-K)"+’" > E)Li(i-p»)" 

E*tiPi(i“ Ef=iPi(i 

s s s s 

E(i -E ^ Epi(i - Pir+^~^ E(i - pp' 

i=l / = i *=1 i = i 

^ [(1 - pM+^pM - - Pi{i - PiT+^-\i - pM+ 

+ (1 - Pi)"+”>Pi(i - Pi)"-i - Pi(i - pj)"+”>-Hi - Pi)"] > 0 
^ [(1 -Pi)"-i(i -p,)-i(p,- -K)((i - Pi)'" - (1 -PjD > 0 

(18) 

But the last inequality always holds since each term of the sum¬ 
mation is positive. In particular, if pj > pi then also 1 —pi > 1 —pj 
and if pj < pi then 1 — pi <1 — pj. □ 

Given the monotonicity of function K, we model X as a general¬ 
ized logistic function of the form K(x) — i^^xp{^G{x-D)) • 
we observe samples of different sizes for different queries we esti¬ 
mate K as described above and therefore we observe different re¬ 
alizations of /(•). Thus, we can learn the parameters of / and use 
it to estimate K'. In the presence of an exclude list of size I we fol¬ 
low the approach described in Section to update the quantities 
fi and n used in the analysis above. 

4. DISCOVERING QUERYING POLICIES 

Next, we focus on the second component of our proposed algo¬ 
rithmic framework and introduce a multi-round adaptive optimiza¬ 
tion algorithm for identifying querying strategies that maximize the 
total gain across all rounds under the given budget constraints. We 
build upon ideas from the multi-armed bandit literature mm. At 
each round, the proposed algorithm uses as input the estimated gain 
or return for different generalized queries q{k^E) at the different 
nodes in TLd- Before presenting our framework we list several 
challenges associated with this adaptive optimization problem. 

• The first challenge is that the number of nodes inTlD is expo¬ 
nential in the number of attributes Ad describing the domain of 
interest. Querying every possible node to estimate its expected 
return for different queries q{k, E) is prohibitively expensive. 

That said, typical budgets do not allow algorithms to query all 
nodes in the hierarchy, so this intractability may not hurt us 
all that much. Eor example, we keep estimates for each of the 
nodes for which at least one entity has been retrieved. 

• The third challenge is balancing the tradeoff between exploita¬ 
tion and exploration The first refers to querying nodes for 
which sufficient entities have been retrieved and hence we have 
an accurate estimate for their expected return; the latter refers 
to exploring new nodes in Tin to avoid locally optimal policies. 




















4.1 Balancing Exploration and Exploitation 

While issuing queries q{k^ E) at different nodes of Ed we ob¬ 
tain a collection of entities that can be assigned to different nodes in 
Ed. For each node we can estimate the return of a query q{k,E) 
using the estimators presented in Section [3 However, this esti¬ 
mate is based on a rather small sample of the underlying popula¬ 
tion. Thus, exploiting this information at every round may lead to 
suboptimal decisions. This is the reason why one needs to balance 
the trade-off between exploiting nodes for which the estimated re¬ 
turn is high and nodes that have not been queried many times. For¬ 
mally, the latter corresponds to upper-bounding the expected return 
of each potential action with a confidence interval that depends on 
both the variance of the expected return and the number of times an 
action has been evaluated. 

Let r{a) denote the expected return of action a that is an esti¬ 
mate of the true return r* (a). Moreover, let cr(a) be an error com¬ 
ponent on the return of action a chosen such that r{a) — a (a) < 
r*{a) < r{a) + a{a) with high probability. The parameter a{a) 
should take into account both the empirical variance of the expected 
return as well as our uncertainty if an action or similar actions (e.g., 
queries with different /c, E but at the same node) has been chosen 
few times. Let na,t be the number of times we have chosen action 
a by round t, and let Voc,t denote the maximum value between some 
constant c (e.g., c = 0.01) and the empirical variance for action a 
at round t. The latter can be computed using bootstrapping over 
the retrieved sample and applying the estimators presented in Sec¬ 
tion [3]3]over these bootstrapped samples. Several techniques have 
been proposed in the multi-armed bandits literature to compute the 
parameter a (a) ll^ . Teytaud et al. iia showed that techniques 
considering both the variance and the number of times an action has 
been chosen tend to outperform other proposed methods. Based on 
this observation, we choose to use the following formula for sigma: 

,(<,1 = / --■■ogW ,19) 

Y na,t 

4.2 A Multi-Round Querying Policy Algorithm 

We now introduce a multi-round algorithm for solving the bud¬ 
geted entity enumeration problem. At a high-level, the algorithm 
proceeds as follows: Instead of considering all potential queries 
q{k^ E) that can be issued at the different nodes of we con¬ 
sider all potential query configurations (k^l). In particular, we do 
not optimize directly for the exclude list to be used in a further 
query but rather for the size I of it. Once we decide on I the ex¬ 
clude list E can be constructed following a randomized approach, 
where I of the retrieved entities are included in the list uniformly 
at random. The generated list can be used to update the frequency 
counts fi and sample size n and estimate the gain of the query. 
Bootstrapping can also be used to obtain improved estimates. 

We follow a randomized approach as a deterministic construc¬ 
tion of E that picks the l-most popular items in the running sample 
is very sensitive to the observed popularity distribution. When the 
number of observed entities corresponds to a small portion of the 
entire population - as in the scenarios we consider in this paper - the 
individual entity popularity estimates tend to be very noisy. We em¬ 
pirically observed that a deterministic construction of a limited size 
exclude list, especially during early queries, leads to poor popular¬ 
ity estimates. Thus, we choose to follow a randomized approach. 

Let S denote the set of all potential query configurations (/c, /) 
that can be issued at the different nodes of Ed during a round 
r. Moreover, let r{a) + a (a) and c{a) be the upper-bounded 
return (i.e., gain) and cost for an action a G <S. At each round 
the algorithm identifies an action in S that maximizes the quan- 


r{ayvEa) constraint that the cost of action a is less 

or equal to the remaining budget. Since we are operating under a 
specified budget one can view the problem in hand as a variation 
of the typical knapsack problem. If no such action exists then the 
algorithm terminates. Otherwise the algorithm issues the query cor¬ 
responding to action a, updates the set of unique entities obtained 
from the queries, the remaining budget and updates the set of po¬ 
tential queries that can be executed in the next round. An overview 
of this algorithm is shown in Algorithm [T] 

As discussed before, the size of Ed is exponential to the val¬ 
ues of attributes describing it, and thus, considering all the possible 
queries for the different nodes of Ed can be prohibitively expen¬ 
sive. Next, we discuss how one can initialize and update the set of 
potential actions as the algorithm progresses based the structure of 
the poset Ed and the retrieved entities from previous rounds. 


Algorithm 1 Overall Algorithm 

1: Input: Ed', the hierarchy describing the entity domain; r, a: value 
oracle access to gain upper bound; c: value oracle access to the query 
costs; (3c: query budget; 

2: Output; S: a set of extracted distinct entities; 

3: S ^ {} 

4: RB <r- (3c /* Initialize remaining budget */ 

5: 5 ^ UpdateActionSet(4^i:)) 

6: while RB > 0 and S' / {} do 

7: a ^ argmaxQ;^5 ~ ^ ^ 

8: if a is NULL then 

9: break; 

10: RB ^ RB — c{(y) 1^ Update budget */ 

11: Issue query corresponding to a 

12: E entities from query 

13: S^SUL^/* Update unique entities */ 

14: S ^ UpdateActionSet(7^z:») 

15: return S 


4.3 Updating the Set of Actions 

Due to the exponential size of the poset Ed, need to limit 
the set of possible actions Algorithm [T] considers by exploiting the 
structure the given domain Ed. We propose an algorithm that 
updates the set of actions by traversing the input poset in a top- 
down manner and adds new actions that correspond to queries for 
nodes that are direct descendants of already queried nodes. Due to 
the hierarchical structure of the poset nodes at higher levels of the 
poset correspond to larger populations of entities. Therefore, issu¬ 
ing queries at these nodes can potentially result in a larger number 
of extracted entities. Also, traversing the poset in a top-down man¬ 
ner allows one to detect sparsely populated areas of the poset. 

Our approach for updating the set of available actions (Alg. [2]) 
proceeds as follows: If the set of available actions is empty start 
by considering all possible queries that can be issued at the root of 
Ed (Ln. 4-5). The set of possible queries corresponds to queries 
q{k,E) for all combinations of the values of parameters k and 
1. Recall that E is constructed in a randomized fashion once I 
is determined. Recall that these are pre-specified by the designer 
of the querying interface. If the set of available actions is not 
empty, we consider the node associated with the action selected 
in the last round and populate the set of available actions with all 
the queries corresponding to its direct descendants (Ln. 7-9), i.e., 
by traversing the input poset in a bottom-down fashion. As men¬ 
tioned above the number of nodes in Ed can be prohibitively large, 
therefore we also remove any bad actions from the running set of 
actions (Ln. 10-14). An action a is bad when r{a) + a {a) < 
maxc^/^ 5 (r(a') — a{a)). Intuitively, this states that we do not 










need to consider an action as long as there exists another action 
such that the upper-bounded return of the former is lower than the 
lower bounded return of the latter. This is a standard technique 
adopted in multi-armed bandits to limit the number of actions con¬ 
sidered by the algorithm fH. 


Algorithm 2 UpdateActionSet 

1: Input: I-Ld'- the hierarchy describing the entity domain; u\ a node in 
I-Ld associated with the last selected action; the running set of 

actions; 14 : set of values for query parameter fc; V^: set of values for 
query parameter l\ 

2: Output: Snew'^ the updated set of actions; 

3: /* Extend Set of Actions*/ 

4: if tSoZd is empty then 
5: return {Root of "H 2 :)} 

6: Snew — ^old 

7: for all d G Set of Direct Descendant Nodes of n in Hd do 
8: ^ Set of queries at u for all configurations in 14 x Vi 

9. Snew Snew 

10: /* Remove Bad Actions*/ 

11: /* Find maximum lower bound on gain over all actions in Snew */ 

12: thres ^ (r(a') — cr(a')) 

13: B ^ All actions a in Snew with r(a) + a (a) < thres 
14. Snew Snew \B 

15: return Snew 


5. EXPERIMENTAL EVALUATION 

We present an empirical evaluation of our proposed algorith¬ 
mic framework using both real and synthetic datasets. First, we 
discuss the experimental methodology, then we describe the data 
and results that demonstrate the effectiveness of our framework on 
crowdsourced entity extraction. The evaluation is performed on an 
Intel(R) Core(TM) i7 3.7 GHz 32GB machine; all algorithms are 
implemented in Python 2.7. 

5.1 Experimental Setup 

Gain Estimators. We evaluate the following gain estimators: 

• Chao92Shen: This estimator combines the methodology pro¬ 
posed by Chao ||6l for estimating the number of unseen species 
with Shen’s formula, i.e.. Equation (1). 

• HwangShen: This estimator combines the regression-based ap¬ 
proach proposed by Hwang and Shen iMl for estimating the 
number of unseen species with Shen’s formula. 

• NewRegr: This estimator corresponds to our new technique 
proposed in Section 

All estimators were coupled with bootstrapping to estimate their 
variance to retrieve an upper bound on the return of a query as 
shown in Section 144] 

Entity Extraction Algorithms. We evaluate the following algo¬ 
rithms for crowdsourced entity extraction: 

• Rand: This algorithm executes random queries until all the 
available budget is used. It selects a random node from the 
input poset TId and a random query configuration (/c, /) from 
a list of pre-specified k, I value combinations. We expect Rand 
to be effective for extracting entities in small and dense data 
domains that do not have many sparsely populated nodes. 

• RandL: Same as Rand but only executes queries only at the low¬ 
est level nodes (i.e., leaf nodes) of the input poset 1-Ld until all 
the available budget is used. We expect RandL to be effective 
for shallow data domains when the majority of nodes corre¬ 
sponds to leaf nodes. Like Rand, the performance of RandL 


is expected to be reasonable for small and dense data domains 
without sparsely populated nodes. 

• BPS: This algorithm performs a breadth-first traversal of the 
input poset 1-Ld , executing one query at each node. The query 
configuration is randomly selected from a list of pre-specified 
k, I value combinations. This algorithm promotes exploration 
of the action space when extracting entities. It also takes into 
account the structure of the input domain but is agnostic to 
sparsely populated nodes of the input Hd ■ 

• RootChao: This algorithm corresponds to the entity extraction 
scheme of Trushkowsky et al. that utilizes the Chao92Shen 
estimator to measure the gain of an additional query. The pro¬ 
posed scheme is agnostic to the structure of the input entity 
domain, and thus, equivalent to issuing queries only at the root 
node of the posQiHD- Since the authors only propose a pay-as- 
you-go scheme, we coupled this algorithm with Alg. □ to opti¬ 
mize for the input budget constraint. We allowed the algorithm 
to consider different query configurations {k,l) but restricted 
the possible queries to the root node. 

• GSChao, GSHWang, GSNewR: These algorithms correspond 
to our proposed querying policy algorithm (Section 14.21) cou¬ 
pled with Chao92Shen, HwangShen and NewRegr respectively. 

• GSExact: This algorithm is used as a near-optimal, omniscient 
baseline that allows us to see how far off our algorithms are 
from an algorithm with perfect information. In particular, we 
combine the algorithm proposed in Section 14.21 with an ex¬ 
act computation of the return or gains from queries. More 
precisely, the algorithm proceeds as follows: At each round 
we speculatively execute each of the available actions (i.e., all 
query configurations across all nodes) and select the one that 
results in the largest number of return to cost ratio. Since the 
return of each query is known, the algorithm is not coupled with 
any of the aforementioned estimators. 

Rand, RandL and BES promote the exploration of the action 
space when extracting entities, while the other algorithms balance 
exploration with exploitation. Eor the results reported below, we 
run each algorithm ten times and report the average gain achieved 
under the given budget. 

Querying Interface. Eor all datasets we consider generalized queries 
of the type “Give me k more entities that satisfy certain conditions 
and are not present in an exclude list of size /”. The conditions 
correspond to matching the attribute values associated with a node 
from the input poset. The configurations considered for (/c, /) are 
{(5, 0), (10, 0), (20, 0), (5, 2), (10, 5), (20, 5), (20,10)}. Larger val¬ 
ues of k or I were deemed unreasonable for crowdsourced queries. 
The gain of a query is computed as the number of new entities 
extracted. The cost of each query is computed using an additive 
model comprised by three partial cost terms that depend on the 
characteristics of the query. 

The three partial cost terms are: (i) CostK that depends on the 
number of responses k requested from a user, (ii) CostL that de¬ 
pends on the size of the exclude list I used in the query, and (iii) 
CostSpec that depends on the specificity of the query qs, e.g., we 
assume that queries that require users to provide more specialized 
entities (e.g., “Give me one concert for New York on the 17th of 
Nov”) cost more than more generic queries (e.g., “Give me one 
concert in New York”). More formally, we define the specificity 
of a query to be equal to the number of attributes assigned non¬ 
wildcard values for the node u G 77i:» the query corresponds to. 

The overall cost for a query with configuration (kfi) with speci- 





Table 1: The population characteristics for the People’s domain. 


News Portal 

People 

WSJ 

594 

WashPost 

597 

NY Times 

595 

HuffPost 

599 

USA Today 

593 


Person Type 

People 

Industry People 

743 

Athletes 

743 

Politicians 

748 

Actors/Singers 

744 


ficity s is computed as: 

Cost{q) = a - - -^-h/3-^-h7- . (20) 

max. query size max. ex. list size max. specificity 

The cost of a query should be significantly increased when an 
exclude list is used, thus we require that /3 is set to a larger value 
than a and 7 . For the results reported below, we set ct = 7 = 1 
and /3 = 5. Similar results were observed for other settings. 

Data. First, we evaluate the proposed framework on extracting en¬ 
tities from a large sparse domain. We consider the event dataset 
collected from Eventbrite. As described in Section [T] the poset 
eorresponding to the Eventbrite domain eontains 8,508,160 nodes 
with 57,805 distinct events overall. However, only 175,068 nodes 
are populated leading to a rather sparsely populated domain. Due 
to laek of popularity proxies for the extracted events, we assigned 
a random popularity value in (0,10] to each event. These weights 
are used during sampling to form the aetual popularity distribution 
characterizing the population of each node in the poset. 

We further evaluate the performance of the extraction algorithms 
for a more dense domain, that we eonstructed ourselves. We used 
Amazon’s Mechanical Turk m to collect a real-world dataset, tar¬ 
geted at extracting “people in the news”. While different from the 
event extraction domain studied before this new domain is still 
structured. We asked workers to extraet the names of people be¬ 
longing to four different types from five different news portals. 
The people types we considered are “Politicians”, “Athletes”, “Ac¬ 
tors/Singers” and “Industry People”. The news portals we con¬ 
sidered are “New York Times”, “Huffington Post”, “Washington 
Post”, “USA Today” and “The Wall Street Journal”. This data do¬ 
main, referred to as the People’s domain, is essentially character¬ 
ized by the type of the individual and the news portal. Workers 
were paid $0.20 per HIT. We issued 20 HITS for eaeh leaf node of 
the domain’s poset, resulting in 600 HITS in total. After manually 
curating name misspelling’s, we extracted 1,245 unique people in 
total. Table [T] shows the number of distinct entities for the differ¬ 
ent values of the people-type and news portal attributes. Finally, 
the popularity value of each extracted entity was assigned to be 
equal to the number of times it appeared in the extraction result. 
The values are normalized during sampling time to form a proper 
popularity distribution. Collecting a large amount of data in ad- 
vanee from Mechanical Turk and then simulating the responses of 
human workers by revealing portions of this dataset allows us to 
compare different algorithms on an equal footing; this approach is 
often adopted in the evaluation of crowdsourcing algorithms ll24l 
ED [361. 

5.2 Experimental Results 

Next, we evaluate different aspeets of the aforementioned extrae- 
tion techniques. 

How does our querying policy algorithm compare against base¬ 
line^ We evaluate the performance of the different extraetion al¬ 
gorithms in terms of number of entities extracted for different bud¬ 
gets. The results for Eventbrite and the People’s domain are shown 
in Eigure |7(a)| and Eigure |7(b)| respeetively. As shown, our pro¬ 
posed algorithms, i.e., GSChao, GSHwang, GSNewR outperform 
all baselines for at least 30% across both datasets. This behavior 


is expeeted as our techniques not only exploit the structure of the 
domain to diversify entity extraction by targeting entities that be¬ 
long to the tail of the popularity distribution but also optimize the 
queries for the given budget. 

When comparing again the naive baselines Rand, RandL, and 
BFS, we see that GSChao, GSHwang and GSNewR extraet at least 
2X more entities for the sparse Eventbrite domain and around 100% 
more entities for small budgets and 54% for larger ones when con¬ 
sidering the dense People’s domain. For example for Eventibrite 
and a budget of $50 all schemes eoupled with our querying pol¬ 
icy discovery algorithm (Seetionjd]) extraeted more than 600 events 
while Rand and RandL extracted 1.1 and 0.2 events and BFS ex¬ 
tracted 207.7 events, an improvement of over 180%. 

Comparing against RootChao, we see that GSChao, GSHwang 
and GSNewR, are able to retrieve up to 30% more entities for 
Eventbrite and 5X for the People’s domain. This performance dif¬ 
ference is due to the fact that the gain achieved by RootChao satu¬ 
rates at a faster rate compared to GSChao, GSHwang and GSNewR 
as the cost increases. This is because, RootChao focuses on issuing 
queries at the root of the input poset, and hence, it is not able to 
extract entities belonging to the long tail of the popularity distri¬ 
bution. Moreover, for the People’s domain we see that RootChao 
performs poorly even eompared to the naive baselines Rand, RandL 
and BFS. Again, this behavior is due to the skew of the underlying 
popularity distribution. 
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Figure 7: A comparison of the proposed entity extraction techniques against 
several baselines for (a) Eventbrite and (b) the People’s domain. 

How do our techniques compare against a near-optimal pol- 
icy discovery algorithm? Next, we evaluate GSChao, GSHwang 
and GSNewR against the near-optimal querying policy discovery 
algorithm GSExaet. The results for Eventbrite and the People’s 
domain are shown in Figure [ 8 ^ and Figure [ 8 ^ respeetively. Re¬ 
garding the dense domain Eventbrite, we observe that for smaller 
budgets our proposed techniques perform comparably to GSExaet 
that has “perfect information” about the gain of each query, typi¬ 
cally demonstrating a performance gap of less than 10%. For larger 
budgets this gap increases to 25%. Note that our estimators have 
aceess to few samples and sparse information; the faet that we are 
able to get this close to GSExaet is notable. Finally, for the Peo¬ 
ple’s domain, our teehniques present an inereased performance gap 
compared to GSExaet. Nevertheless the performance drop is at 
most 50%. 

How do the different techniques compare with respect to the 
total number of queries issued during extraction? We compare 
the performance of RootChao (i.e., the extraction scheme proposed 
by Trushkowsky et al. (361) against our algorithms GSChao, GSH¬ 
wang and GSNewR with respect to the total number of queries 
issued during extraction. Notiee that this new evaluation metric 
characterizes directly the overall latency of the crowd-extraction 
process. Figure [9] shows the corresponding results for a run for 
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Figure 8: A comparison of the proposed entity extraction techniques against 
a near-optimal algorithm for (a) Eventbrite and (b) the People’s domain. 


Eventbrite and a budget of $80. As shown RootChao requires al¬ 
most up to 3x more queries to extraet the same number of entities 
as our proposed techniques, thus, exhibiting significantly larger la¬ 
tency compared to GSChao, GSHwang and GSNewR. 
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Figure 9: The number of events extracted by different algorithms for the 
Eventbrite data domain and the corresponding total number of queries. 


How our different algorithms traverse the poset and use dif- 
ferent query configurations? We next explore how our different 
algorithms traverse the poset, and how they use different query con¬ 
figurations. The results reported are averaged over ten runs and 
correspond to the People’s domain. We begin by considering how 
many queries these algorithms issue at various levels of the poset. 
In Figure [TOl we plot the different number of queries issued at var¬ 
ious levels by our algorithms when the budget is set to 10 and 100 
respectively. Given a small budget, we observe that all algorithms 
prefer issuing queries at higher levels of the poset. Notice that inner 
nodes of the poset are preferred and only a small number of queries 
is issued at the root (i.e., level one) of the poset. This behavior 
is justified if we consider that due to their popularity, certain enti¬ 
ties are repeatedly extracted, thus leading to a lower gain. As the 
budget increases, we see that all algorithms tend to consider more 
specialized queries at deeper levels of the poset. It is interesting to 
observe that all of our algorithms issue the majority of their queries 
at the level two nodes, while GSExact, which has perfect infor¬ 
mation, focuses mostly on the leaf nodes. Thus, in this case, our 
techniques could benefit from being more aggressive at traversing 
the poset and reaching deeper levels; overall, our techniques may 
end up being more conservative in order to cater to a larger space 
of posets and popularity distributions. In Figure [TT] we plot the 
different query configurations chosen by our algorithms when the 
budget is set to 10 and 100 respectively. We observe that GSEx¬ 
act always prefers queries with /c = 20 and / = 0 for both small 
and large budgets. On the other hand, our algorithms issue more 
queries of smaller size when operating under a limited budget and 
prefer queries of larger size for larger budgets. Out of all algo¬ 
rithms we see that GSNewR was the only one issuing queries with 
exclude lists of different sizes, thus exploiting the rich diversity of 


Table 2: Average absolute relative error for estimating the gain of different 
queries for Eventbrite. 


Q. Size k 

EL. Size 1 

Chao92Shen 

HwangShen 

NewRegr 

5 

0 

0.470 

0.500 

0.390 

5 

2 

0.554 

0.612 

0.467 

10 

0 

0.569 

0.592 

0.544 

10 

5 

0.580 

0.696 

0.29 

20 

0 

0.642 

0.756 

0.471 

20 

5 

0.510 

0.60 

0.436 

20 

10 

0.653 

0.756 

0.631 


query interfaces. However, the number of such queries is limited. 
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Figure 10: The number of queries issued at different levels used when bud¬ 
get is set at 10 or 100. 
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Figure 11: The query configurations used when budget is set at 10 or 100. 

How effective are the different estimators at predicting the gain 
of additional queries? Finally, we point out that GSNewR was 
able to outperform GSChao and GSHwang for Eventbrite but the 
opposite behavior was observed for the People’s domain. To fur¬ 
ther understand the relative performance of GSChao, GSHwang 
and GSNewR, we evaluate the performance of the gain estima¬ 
tors Chao92Shen, HwangShen and NewRegr at predicting the num¬ 
ber of new retrieved events for different query configurations. For 
Eventbrite, we choose ten random nodes containing more than 5,000 
events and for each of them and each of the available query param¬ 
eter configurations (/c, /), we execute ten queries of the form “Give 
me k items from node u ^ Hd that are not included in an exclude 
list of size As mentioned in Section 13.21 the exclude list for 
each query is constructed following a randomized approach. For 
the People’s domain, we issue ten queries over all nodes of the in¬ 
put poset for all available query configurations. We measure the 
performance of each estimator by considering the absolute relative 
error between the predicted return and the actual return of the query. 

Table |2] reports the relative error for each of the three estima¬ 
tors averaged over all points under consideration for Eventbrite. 
As shown, all three estimators perform equivalently with the new 
regression-based technique slightly outperforming Chao92Shen and 
HwangShen for certain types of queries. For example, for k = 
10 ,1 = 5, Chao92Shen has a relative error of 0.58, HwangShen 
had a relative error of 0.7, and NewRegr had a relative error of 0.29. 
We attribute the improved extraction performance of GSNewR to 
these improved estimates. The relatively large values for relative 
errors are justified as the retrieved samples correspond to a very 
small portion of the underlying population for each of the points. 
This is a well-known behavior for non-parametric estimators and 
studied extensively in the species estimation literature M- 

Table [3] shows the results for the People’s domain. We observe 
that for smaller query sizes the regression technique proposed in 
this paper offers better gain estimates. However, as the query size 













































increases, and hence, a larger portion of the underlying population 
is observed Chao92Shen outperforms both regression-based tech¬ 
niques. Thus, we are able to explain the performance difference 
between GSChao and the other two algorithms. Eventually, we 
have that for sparse domains regression-based techniques result in 
better performance. However, for dense domains the Chao92Shen 
estimator results in better performance as a larger portion of the 
underlying population can be sampled. 

Table 3: Average absolute percentage error for estimating the gain of dif¬ 
ferent queries for the People’s data domain. 


Q. Size k 

EL. Size 1 

Chao92Shen 

HwangShen 

NewRegr 

5 

0 

0.295 

0.299 

0.228 

5 

2 

0.163 

0.156 

0.144 

10 

0 

0.306 

0.305 

0.277 

10 

5 

0.341 

0.349 

0.293 

20 

0 

0.359 

0.371 

0.467 

20 

5 

0.2615 

0.264 

0.249 

20 

10 

0.1721 

0.162 

0.127 


6 . RELATED WORK 

The prior work related to the techniques proposed in this paper 
can be placed in a few categories; we describe each of them in turn: 

Crowd Algorithms. There has been a significant amount of work 
on designing algorithms where the unit operations (e.g., compar¬ 
isons, predicate evaluations, and so on) are performed by human 
workers, including common database primitives such as filter na, 
join |20l and max mi, machine learning primitives such as en¬ 
tity resolution EllMl and clustering 1281 . as well as data mining 
primitives 13 Ea. 

Previous work on the task of crowdsourced extraction or enumer¬ 
ation, i.e., populating a database with entities using the crowd |26l 
EH is the most related to ours. In both cases, the focus is on a 
single entity extraction query; extracting entities from large and 
diverse data domains is not considered. Moreover, the proposed 
techniques do not support dynamic adaptation of the queries issued 
against the crowd to optimize for a specified monetary budget. 

Knowledge Acquisition Systems. Recent work has also consid¬ 
ered the problem of using crowdsourcing within knowledge acqui¬ 
sition systems ITbl (TS] [39l. This line of work suggests using the 
crowd for curating knowledge bases (e.g., assessing the validity of 
the extracted facts) and for gathering additional information to be 
added to the knowledge base (e.g., missing attributes of an entity 
or relationships between entities), instead of augmenting the set of 
entities themselves. As a result, these papers are solving an orthog¬ 
onal problem. The techniques described in this paper for estimat¬ 
ing the amount of information from a query and devising querying 
strategies to maximize the amount of extracted information will 
surely be beneficial for knowledge extraction systems as well. 

Deep Web Crawling. A different line of work has focused on data 
extraction from the deep web II171I321 . In such scenarios, data is ob¬ 
tained by querying a form-based interface over a hidden database 
and extracting results from the resulting dynamically-generated an¬ 
swer (often a list of entities). Typically, such interfaces provide 
partial list of matching entities to issued queries; the list is usually 
limited to the top-k tuples based on an unknown ranking function. 
Sheng et al. 03 provide near-optimal algorithms that exploit the 
exposed structure of the underlying domain to extract all the tuples 
present in the hidden database under consideration. Our work is 
similar to this work in that our goal is to also extract entities via 
a collection of interfaces (in our case the interfaces correspond to 
queries asked to the crowd). 

The main difference between this line of work and ours is that an¬ 
swers from a hidden database are deterministic, i.e., a query in their 


setting will always retrieve the same top-k tuples. This assumption 
does not hold in the crowdsourcing scenario considered in this pa¬ 
per and thus the proposed techniques are not applicable. In their 
setting, it suffices to ask each query precisely once. In our setting, 
since crowdsourced entity extraction queries can be viewed as ran¬ 
dom samples from an unknown distribution, one needs to make use 
of the query result estimation techniques introduced in Section [3] 

7. CONCLUSIONS AND FUTURE WORK 

In this paper, we studied the problem of crowdsourced entity ex¬ 
traction over large and diverse data domains. We introduced a novel 
crowdsourced entity extraction framework that combines statistical 
techniques with an adaptive optimization algorithm to maximize 
the total number of unique entities extracted. We proposed a new 
regression-based technique for estimating the gain of further query¬ 
ing when the number of retrieved entities is small with respect to 
the total size of the underlying population. We also introduced a 
new algorithm that exploits the often known structure of the under¬ 
lying data domain to devise adaptive querying strategies. Our ex¬ 
perimental results show that our techniques extract up to 4X more 
entities compared to a collection of baselines, and for large sparse 
entity domains are at most 25% away from an omniscient adaptive 
querying strategy with perfect information. 

Some of the future directions for extending this work include rea¬ 
soning about the quality and correctness of the extracted result as 
well as extending the proposed techniques to other types of infor¬ 
mation extraction tasks. As mentioned before, the techniques pro¬ 
posed in this paper do not deal with incomplete and imprecise infor¬ 
mation. However, there has been an increasing amount of literature 
on addressing these quality issues in crowdsourcing □[13122133. 
Combining these techniques, or entity resolution techniques 1^ 
that reason about similarity of extracted entities, with our proposed 
framework is a promising future direction. Finally, it is of particular 
interest to consider how the proposed framework can be applied to 
other budget sensitive information extraction applications includ¬ 
ing discovering valuable data sources for integration tasks 12 ^1^ 
or curating and completing a knowledge base mi. 
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